Process Migration and Fault Tolerance of BSPlib Programs Running on Networks of Workstations

نویسندگان

  • Jonathan M. D. Hill
  • Stephen R. Donaldson
  • Tim Lanfear
چکیده

This paper describes a system that enables parallel programs written using the BSPlib communications library to migrate processes among a network of workstations. Not only does the system provide fault tolerance of BSPlib jobs, but by utilising a load manager that maintains an approximation of the global load of the system, it is possible to continually schedule the migration of BSP processes onto the least loaded machines in a network. Results are provided for an industrial electro-magnetics application that show that we can achieve similar throughput on a publically available collection of workstations as a dedicated NOW.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Performance of Coordinated Checkpointers on Networks of Workstations using RAID Techniques

Coordinated checkpointing systems are popular and general-purpose tools for implementing process migration , coarse-grained job swapping, and fault-tolerance on networks of workstations. Though simple in concept , there are several design decisions concerning the placement of checkpoint les that can impact the performance and functionality of coordinated checkpointers. Although several such che...

متن کامل

Analysing an SQL Application with a BSPlib Call-Graph Profiling Tool

This paper illustrates the use of a post-mortem call-graph profiling tool in the analysis of an SQL query processing application written using BSPlib [4]. Unlike other parallel profiling tools, the architecture independent metric of imbalance in size of communicated data is used to guide program optimisation. We show that by using this metric, BSPlib programs can be optimised in a portable and ...

متن کامل

Transparent Fault Tolerance for Parallel Applications on Networks of Workstations

This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communicat...

متن کامل

Mechanism for Implementation of Load Balancing using Process Migration

The feature of load sharing or load balancing involves migration of running processes from highly loaded workstations of a network to the lightly-loaded or idle workstations of the network. This paper describes load balancing techniques to share the workload of the workstations belonging to a particular network to gain better performance from the overall network. The mechanisms of load informat...

متن کامل

Executing multithreaded programs efficiently

This thesis presents the theory, design, and implementation of Cilk (pronounced “silk”) and Cilk-NOW. Cilk is a C-based language and portable runtime system for programming and executing multithreaded parallel programs. Cilk-NOW is an implementation of the Cilk runtime system that transparently manages resources for parallel programs running on a network of workstations. Cilk is built around a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998